import numpy as npML Course Part 1 - Introduction to Machine Learning
1 Introduction
1.0.1 Definition
Machine Learning (ML)
Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.
ML Model
Computer object implementing a ML algorithm, trained on a set of data to perform a given task.
2 Definitions
2.1 Categories of ML
2.1.1 Type of dataset
- Supervised: for each input in the dataset, the expected output is also part of the dataset
- Unsupervised: for each input in the dataset, the expected output is not part of the dataset
- Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
- Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions
2.1.2 Type of output
- Classification: assigning one (or multiple) label(s) chosen from a given list of classes to each element of the input
- Regression: assigning one (or multiple) value(s) chosen from a continuous set of values
- Clustering: create categories by grouping together similar inputs
2.2 Dataset
2.2.1 Definition
Dataset
A collection of data used to train, validate and test ML models.
2.2.2 Content
Instance (or sample)
An instance is one individual entry of the dataset.
Feature (or attribute or variable)
A feature is a type of information stored in the dataset about each instance.
Label (or target or output or class)
A label is a piece of information that the model must learn to predict.
2.2.3 Subsets
Dataset subsets
A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:
- Training set: used during training to train the model,
- Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
- Test set: used after training to evaluate the performance of the model on new data it has not encountered before.
3 Overview of ML methods
3.1 Supervised Learning
3.1.1 Linear Regression
Used for predicting continuous values:
- Simple: \(y = \alpha + \beta x\)
- Multiple: \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\)
- Polynomial: \(y = \alpha + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n\)
- and many others…
3.1.2 Logistic Regression
Used for classification problems:
- Binomial: only two possible categories
- Multinomial: three or more possible categories
- Ordinal: three or more possible categories which are ordered
3.1.3 Decision Trees
A tree-like structure used for both classification and regression
3.1.4 Random Forests
An ensemble method that combines multiple decision trees:
- Train independently \(B\) trees using:
- Bagging: each tree is fitted on a random subset of the training set
- Feature bagging: each split in the decision tree (i.e. each node) is chosen among a subset of the features
- Take a decision by aggregating individual decisions of each tree
3.1.5 Boosting
An ensemble method that combines weak learners (usually decision trees) to form a stronger model:
- Choose a simple base learner (e.g. small decision trees with fixed number of leaves)
- Repeatedly:
- Train a new base learner on the weighted training set
- Add this new learner to the ensemble
- Evaluate the performance of the ensemble
- Give more weight in the training set to misclassified data
3.1.6 K-Nearest Neighbors (KNN)
A non-parametric method for classification and regression
3.1.7 Naive Bayes
A probabilistic classifier based on Bayes’ theorem.
3.1.8 Support Vector Machines (SVM)
Used for classification and regression, effective in high-dimensional spaces:
- Separate the feature space using optimal hyperplanes
- Features are mapped in a higher dimensional space to allow to fit non-linearly in the original feature space
3.2 Unsupervised Learning
3.2.1 K-Means Clustering
A method for partitioning data into \(k\) clusters:
- \(k\) must be chosen a priori
- The principle is to start with random centroids and then iteratively:
- Classify points using the closest centroid
- Move each centroid to the real centroid of its class
- Classical method with lots of variation
3.2.2 Hierarchical Clustering
Builds a hierarchy of clusters using either agglomerative or divisive methods:
- Build a full hierarchy top-down (divisive) or bottom-up (agglomerative)
- Create any number of clusters by cutting the tree
3.2.3 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Clustering based on the density of data points:
- Divides points in 4 categories: core points (in red below), directly reachable (yellow), reachable and outliers (blue)
- Only two parameters: radius size (\(\epsilon\)) and number of neighbors to be core (\(min_{pts}\))
3.2.4 Principal Component Analysis (PCA)
Dimensionality reduction technique to project data into lower dimensions:
- Project data into a space of lower dimension
- Keep as much variance (so as much information) as possible
3.2.5 t-Distributed Stochastic Neighbor Embedding (t-SNE)
A nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data
3.3 Reinforcement Learning
3.3.1 Definitions
3.3.2 Definitions
Policy (in RL)
Function that returns an action given the state of the environment.
On-Policy vs. Off-Policy
Refers to the policy that is used to update the knowledge during training. On-Policy uses the learnt policy, while Off-Policy uses the optimal policy.
3.3.3 Q-Learning / SARSA - Similarities
A value-based reinforcement learning algorithm:
- Iteratively learn the expected output of each action in each situation
- Limited to discrete and simple environments
- Neural Network variants (like Deep Q-Learning) allow to handle more complex environments
3.3.4 Q-Learning / SARSA - Differences
The two methods differ by the estimation of the reward that is used to update the Q-table:
- Q-Learning (Off-Policy) uses the best possible reward: \(Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma Q(s',a') - Q(s,a))\)
- SARSA (On-Policy) uses the reward of the next action: \(Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma \max_{a''} Q(s',a'') - Q(s,a))\)
where:
- \(a\) is the action taken while the environment is in state \(s\), leading to state \(s'\) where action \(a'\) will be taken
- \(\alpha\) is the learning rate
- \(r\) is the reward received after taking action \(a\) in state \(s\) and arriving in state \(s'\)
- \(\gamma\) is the discount factor defining how much we value long-term rewards relatively to short-term rewards
3.3.5 Policy Gradient Methods
Given a model defined with a certain number of parameters, optimize the policy by adjusting these parameters in the direction that improves performance (e.g., REINFORCE algorithm)
3.3.6 Actor-Critic Methods
Combination of an Actor and a Critic learning simultaneously:
- The Actor learns the policy through a parametrized function
- The Critic estimates the value of each action and gives feedback to the Actor
4 Usual pipeline
4.0.1 Overview
- Data acquisition
- Data preprocessing
- Model selection
- Model evaluation
- Final model training
4.0.2 Data acquisition
Gather the data, potentially from multiple different sources. Choosing the right sources can also depend on the choices made in the next steps.
4.1 Data preprocessing
4.1.1 Different issues
Multiple sources of issues and steps to perform:
- Handle different formats
- Remove outliers (mostly for raw data)
- (Optionally) extract features
- Handle missing data
- Normalize
4.1.2 Why normalization?
Idea
A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.
Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:
\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]
4.1.3 Model selection
- Type of model (ML, NN, DL, …)
- Complexity:
- Number of features
- Type of output
- Size of the layers (for NN)
- Number of layers (for NN)
- Hyperparameters
4.1.4 Model training
- Loss selection: depends on the task, the objectives, the specific issues to solve
- Training process selection (lots of different tweaks and improvements can be implemented in NN training)
- Hyperparameter tuning, by repeatedly:
- Selecting one or multiple configurations of hyperparameters
- Training the model one or multiple times
- Determining the best hyperparameters
4.1.5 Model evaluation - Criteria
Criteria selection among the many possible ones:
- For classification:
- Accuracy: for balanced datasets
- Precision: when false positives are costly
- Recall: when false negatives are costly
- F1-Score: when class distribution is unbalanced
- …
- For regression:
- Mean Absolute Error (MAE)
- Mean Square Error (MSE): more sensitive to large errors than MAE
- …
4.1.6 Model evaluation - Cross-validation
Cross-validation
Method to estimate real performance of the model by:
- Splitting the dataset in multiple parts (usually 5)
- For different combinations of these parts (usually 5), training and evaluating the model
4.1.7 Final model training
Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained multiple times to keep the best one.
5 Challenges
5.1 Data
5.1.1 Quality
Quality of the data is obviously crucial to train well-performing models. Quality encompasses multiple aspects:
- Raw data quality: the input data must possess high enough details for the task to be even achievable. Be careful however as more features imply larger models which are longer and harder to train.
- Annotations quality: the annotations must be as precise and correct as possible in the context of the task at hand. Every blunder or outlier in a supervised dataset will slow down training and might result in unexpected behaviors of the trained model.
5.1.2 Diversity
Diversity is the most important aspect of a dataset because ML models are great at generalizing but bad at guessing in new scenarios. There are different aspects to diversity to keep in mind:
- A well-defined task is crucial to identify all the various cases that we want our model to handle. Being as exhaustive as possible when selecting the training instances will accelerate training and improve the model by a lot
- Balancing the dataset can also improve the training. When training on imbalanced datasets (i.e. when some cases are much more represented than others), the model will focus on the most represented situations as it will be the easiest and quickest way to get better results. There are ways of correcting this phenomenon, but it is always better to avoid it if possible when building the dataset.
5.1.3 Biases and fairness
Biased
Refers to a model which always makes the same kind of wrong predictions in similar cases.
In practice, a model trained on biased data will most of the time repeat the biased results. This can have major consequences and shouldn’t be underestimated: even a cold-hearted ML algorithm is not objective if it wasn’t trained on objectively chosen and annotated data.
However, there exist model architectures, training and evaluation methods to prevent and detect biases, which can sometimes allow to build unbiased models using biased data. But this needs to be well-thought and won’t happen unless
5.2 Underfitting and Overfitting
5.2.1 Definitions
Underfitting
When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.
Overfitting
When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.
5.2.2 Illustrations
5.2.3 Solutions
| Solution | Underfitting | Overfitting |
|---|---|---|
| Complexity | Increase | Reduce |
| Number of features | Increase | Reduce |
| Regularization | Reduce | Increase |
| Training time | Increase | Reduce |
General strategies:
- Cross-validation to identify problems
- Grid search/random search to tune hyperparameters and balance between underfitting and overfitting
- Ensemble methods to reduce overfitting by using many smaller models instead of one big
5.3 Interpretable and Explainable
5.3.1 Definitions
Interpretable
Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.
Explainable
Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.
6 Python libraries
6.1 Data manipulation
6.1.1 Libraries
- NumPy
- Fast numerical operations
- Matrices with any number of dimensions (called arrays)
- Lots of convenient operators on arrays
- Pandas
- Can store any type of data
- 1D or 2D tables (called DataFrames)
- Lots of convenient operators on DataFrames
6.1.2 NumPy
array = np.array([[0, 1], [2, 3]])
print(array)[[0 1]
[2 3]]
print(5 * array)[[ 0 5]
[10 15]]
print(np.pow(array, 3))[[ 0 1]
[ 8 27]]
print(array @ array)[[ 2 3]
[ 6 11]]
print(np.where(array < 2, 10 - array, array))[[10 9]
[ 2 3]]
6.1.3 Pandas
import pandas as pd
import numpy.random as nprdf = pd.DataFrame([
["Pi", 3.14159, npr.randint(-100, 101, (2, 2))],
["Euler's number", 2.71828, npr.randint(-100, 101, (2, 2))],
["Golden ratio", 1.61803, npr.randint(-100, 101, (2, 2))]
], columns = ["Names", "Values", "Random numbers because why not"])
print(df) Names Values Random numbers because why not
0 Pi 3.14159 [[94, 50], [1, 77]]
1 Euler's number 2.71828 [[-14, 69], [-82, 57]]
2 Golden ratio 1.61803 [[-22, -27], [79, -100]]
print(df[df["Values"] > 2]) Names Values Random numbers because why not
0 Pi 3.14159 [[94, 50], [1, 77]]
1 Euler's number 2.71828 [[-14, 69], [-82, 57]]
print(df[df["Names"].str.contains("n")]) Names Values Random numbers because why not
1 Euler's number 2.71828 [[-14, 69], [-82, 57]]
2 Golden ratio 1.61803 [[-22, -27], [79, -100]]
6.2 ML
6.2.1 SciPy
Scientific and technical computing based on NumPy. Documentation here
import scipy as sp
import numpy as np# Define a linear system Ax = b
A = np.array([[3, 2], [1, 4]])
b = np.array([6, 8])
# Solve the system
x = sp.linalg.solve(A, b)
print("Solution to the linear system Ax = b:", x)Solution to the linear system Ax = b: [0.8 1.8]
# Define a function to integrate
def integrand(x):
return np.exp(-x**2/2)/np.sqrt(2*np.pi)
# Integrate the function from 0 to infinity
result, error = sp.integrate.quad(integrand, 0, np.inf)
print("Integral of exp(-x**2/2)/sqrt(2*np.pi) from 0 to infinity:", result)Integral of exp(-x**2/2)/sqrt(2*np.pi) from 0 to infinity: 0.4999999999999999
6.2.2 scikit-learn
A lot of tools for ML (except DL). Documentation here
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score# Generate data, create and fit a Linear Regression model
X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
model = LinearRegression()
model.fit(X, y)
print("Coefficient:", model.coef_, "Intercept:", model.intercept_)Coefficient: [75.62546292] Intercept: -0.008095148612447645
# Load the iris dataset and split it in training and test
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit a random forest classifier on training data, and use it
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))Accuracy: 1.0